61 - HPC Café on February 6, 2024: Efficient Data Handling and Data Formats [ID:51532]
50 von 332 angezeigt

So, today we're going to talk about efficient data handling and data formats.

First, I will talk about the basics of our infrastructure and what we see best how to use our storage systems.

And then Adrian later on will talk about more advanced use cases and his experiences with the data set and usage of our systems.

So, OK, so first of all, we have several systems here at NHR at FIU.

And why do we have several file systems?

Because they are used for different purposes.

So we have, for example, home and HPC ball, which are NFS file systems where we have backup and snapshots.

Whereas for work, which is also an NFS file system, but there is no backup and snapshots.

So this also allows us to provide you with much more data for work, for example.

These three file systems are shared on all the clusters and all the nodes.

And we also have other file systems which are not generally available, for example, FASTEMP, which is a high performance parallel IO file system, which is only available on Frizz.

And last, we have TempDear and TempDear is very special because TempDear is a node local job specific directory.

So what does this mean?

This means whenever your job starts, you have TempDear becomes available.

It's a directory created on the node.

And if you have a job with two nodes, then each node has a different directory.

And this directory, as I said, is created when the job starts and is deleted when the job ends.

And depending on the cluster you're on, this directory can be either on a node local SSD, if the node has an SSD, if the node doesn't have an SSD, then it's located in a RAM disk.

And on clusters where we have job sharing on the nodes, so where multiple jobs can be concurrently be on a node, then TempDear or the capacity of the SSD where TempDear is on is shared on the node.

And on the users.

Interestingly, this until now hasn't given us any problems.

Okay.

So why am I telling all this?

The problem is that, so this should be a little sketch of our infrastructure.

It's not real, but it's just here to transfer the point.

So on the right side, we have all the clusters.

On the left side, we have the file systems, home, HPC vault and work.

And as I said, the three file systems are available on all the clusters.

This also means that everybody, every job which uses this file system, so, or all jobs and people share these file systems and usage.

So whenever there is a high load on one of the file system, everybody which currently uses this file system experience this load.

This is, you will either see that if you're interactively using the job that your command prompt returns very slowly or people complaining about that their installation of the PIP environment or so becomes pretty slow.

This is typically due to a high load on one of these files.

Okay.

Also, this means as this file servers are shared among all users, depending on how much users are currently using this file system, you have different characteristics concerning IO bandwidth and IO latency.

Okay.

What can we do about this?

Okay.

What can we do about this?

So first of all, our NFS file servers are potentially good if you have large files, if you read or write large files, because then you have contiguous access and this is very fine.

However, what they don't like that much is having a lot of files lying around, say you have 100,000 pictures of cats or 1 million pictures of cats through a data set on sort and work.

This is not very good even to be Kelly through training.

If you're trying on these pictures, you typically open and close these files.

This really creates a lot of pressure on the file servers.

Then also bad for this file servers are data data observations.

Metadata operations are operations where you access attributes of the files like the modification time or something like that, or listing all the direct the contents of a directory.

So if you do an LS in your data set directory with the million files cats, for example, this is really painful.

So and also that is or greater more load is random access.

If you jump around in a file, go back and forth.

This is also not very good.

So what can we do about this?

Teil einer Videoserie :
Teil eines Kapitels:
HPC Café

Zugänglich über

Offener Zugang

Dauer

00:35:40 Min

Aufnahmedatum

2024-02-06

Hochgeladen am

2024-02-07 10:16:03

Sprache

en-US

Speakers: Markus Wittmann, NHR@FAU and Hadrian Reynaud, AIBE

Abstract:

This HPC Cafe provides a brief overview over our storage infrastructure, available data formats for storing ML/AI datasets, and how to use them efficiently on our systems.  Furthermore Hadrian Reynaud (AIBE) will present his experiences with WebDataset.

Slides: Part 1Part 2

Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/

Einbetten
Wordpress FAU Plugin
iFrame
Teilen